2024-01-30
ols to fit a linear model
olssummary and Predictdfbetaols
helpdat: HELP study)Today’s main data set comes from the Health Evaluation and Linkage to Primary Care trial, and is stored as HELPrct in the mosaicData package.
HELP was a clinical trial of adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care.
| Variable | Description |
|---|---|
id |
subject identifier |
cesd |
Center for Epidemiologic Studies Depression measure (higher scores indicate more depressive symptoms) |
age |
subject age (in years) |
sex |
female (n = 107) or male (n = 346) |
subst |
primary substance of abuse (alcohol, cocaine or heroin) |
mcs |
SF-36 Mental Component Score (lower = worse status) |
pcs |
SF-36 Physical Component Score (lower = worse status) |
pss_fr |
perceived social support by friends (higher = more support) |
helpdat dataWe will look at 453 subjects with complete data today.
helpdat <- tibble(mosaicData::HELPrct) |>
select(id, cesd, age, sex, subst = substance,
mcs, pcs, pss_fr)
df_stats(~ cesd + age + mcs + pcs + pss_fr, data = helpdat) |>
gt() |> fmt_number(min:sd, decimals = 2) |>
tab_options(table.font.size = 20)| response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| cesd | 1.00 | 25.00 | 34.00 | 41.00 | 60.00 | 32.85 | 12.51 | 453 | 0 |
| age | 19.00 | 30.00 | 35.00 | 40.00 | 60.00 | 35.65 | 7.71 | 453 | 0 |
| mcs | 6.76 | 21.68 | 28.60 | 40.94 | 62.18 | 31.68 | 12.84 | 453 | 0 |
| pcs | 14.07 | 40.38 | 48.88 | 56.95 | 74.81 | 48.05 | 10.78 | 453 | 0 |
| pss_fr | 0.00 | 3.00 | 7.00 | 10.00 | 14.00 | 6.71 | 4.00 | 453 | 0 |
helpdat categorical variableshelpdat |> tabyl(sex, subst) |>
adorn_totals(where = c("row", "col")) |>
adorn_percentages(denominator = "row") |>
adorn_pct_formatting() |>
adorn_ns(position = "front") |>
adorn_title(placement = "combined") |>
gt() |> tab_options(table.font.size = 20)| sex/subst | alcohol | cocaine | heroin | Total |
|---|---|---|---|---|
| female | 36 (33.6%) | 41 (38.3%) | 30 (28.0%) | 107 (100.0%) |
| male | 141 (40.8%) | 111 (32.1%) | 94 (27.2%) | 346 (100.0%) |
| Total | 177 (39.1%) | 152 (33.6%) | 124 (27.4%) | 453 (100.0%) |
p1 <- ggplot(helpdat, aes(sample = cesd)) +
geom_qq(col = '#440154') + geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(y = "Sorted CES-D Scores",
x = "Standard Normal Distribution")
p2 <- ggplot(helpdat, aes(x = cesd)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = '#440154', col = '#FDE725') +
stat_function(fun = dnorm,
args = list(mean = mean(helpdat$cesd),
sd = sd(helpdat$cesd)),
col = "red", lwd = 1.5) +
labs(y = "Number of Subjects", x = "CES-D Score")
p3 <- ggplot(helpdat, aes(x = cesd, y = "")) +
geom_violin(fill = '#440154') +
geom_boxplot(width = 0.3, col = '#FDE725', notch = TRUE,
outlier.color = '#440154') +
labs(x = "CES-D Score", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation(title = "CES-D Depression Scores from helpdat data",
subtitle = "Higher CES-D indicates more depressive symptoms",
caption = "n = 453, no missing data")helpdat$cesd : CESD at baseline
n missing distinct Info Mean Gmd .05 .10
453 0 58 0.999 32.85 14.23 10.0 15.2
.25 .50 .75 .90 .95
25.0 34.0 41.0 49.0 52.4
lowest : 1 3 4 5 6, highest: 55 56 57 58 60
Info measures the variable’s information between 0 and 1: the higher the Info, the more continuous the variable is (the fewer ties there are.)Gmd = Gini’s mean difference, a robust measure of variation. If you randomly selected two of the 453 subjects many times, the mean difference in cesd would be 14.23 points.tibble [453 × 8] (S3: tbl_df/tbl/data.frame)
$ id : int [1:453] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "label")= chr "subject ID"
$ cesd : int [1:453] 49 30 39 15 39 6 52 32 50 46 ...
..- attr(*, "label")= chr "CESD at baseline"
$ age : int [1:453] 37 37 26 39 32 47 49 28 50 39 ...
..- attr(*, "label")= chr "age (years)"
$ sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
..- attr(*, "label")= chr "sex"
$ subst : Factor w/ 3 levels "alcohol","cocaine",..: 2 1 3 3 2 2 2 1 1 3 ...
..- attr(*, "label")= chr "primary substance of abuse"
$ mcs : num [1:453] 25.11 26.67 6.76 43.97 21.68 ...
..- attr(*, "label")= chr "SF-36 Mental Component Score"
$ pcs : num [1:453] 58.4 36 74.8 61.9 37.3 ...
..- attr(*, "label")= chr "SF-36 Physical Component Score"
$ pss_fr: int [1:453] 0 1 13 11 10 5 1 4 5 0 ...
..- attr(*, "label")= chr "perceived social support by friends"
We place the outcome (cesd) last (result on next slide.)
ols to fit a linear regression modelolsThe ols function stands for ordinary least squares and comes from the rms package, by Frank Harrell and colleagues. Any model fit with lm can also be fit with ols.
var_y using var_x from the my_tibble data, we would use the following syntax:This leaves a few questions…
datadist stuff doing?Before fitting an ols model to data from my_tibble, use:
Run (the datadist code above) once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables. (excerpt from
datadistdocumentation)
x = TRUE, y = TRUE?Once we’ve set up the summaries with datadist, we fit a model:
ols stores additional information beyond what lm doesx = TRUE and y = TRUE save even more expanded information for building plots and summarizing fit.x = FALSE, y = FALSE, but in 432, we’ll want them saved.ols to fit a modelLet’s try to predict our outcome (cesd) using mcs and subst
datadistx = TRUE, y = TRUEmod1?Linear Regression Model
ols(formula = cesd ~ mcs + subst, data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 295.10 R2 0.479
sigma9.0657 d.f. 3 R2 adj 0.475
d.f. 449 Pr(> chi2) 0.0000 g 9.827
Residuals
Min 1Q Median 3Q Max
-25.43696 -6.74592 0.09334 6.16212 24.24842
Coef S.E. t Pr(>|t|)
Intercept 55.3026 1.2724 43.46 <0.0001
mcs -0.6570 0.0337 -19.48 <0.0001
subst=cocaine -3.4440 1.0055 -3.43 0.0007
subst=heroin -1.7791 1.0681 -1.67 0.0965
olsFor our mod1,
LR chi2 = 295.10, d.f. = 3, Pr(> chi2) = 0.0000The log of the likelihood ratio, multiplied by -2, yields a test against a \(\chi^2\) distribution. Interpret this as a goodness-of-fit test that compares mod1 to a null model with only an intercept term. In ols this is similar to a global (ANOVA) F test.
olsUnder the \(R^2\) values, we have g = 9.827.
cesd will be 9.827.cesd values, from describe, which was Gmd = 14.23.ols fit index.orig training test optimism index.corrected n
R-square 0.4787 0.4874 0.4737 0.0137 0.4650 40
MSE 81.4606 79.7851 82.2361 -2.4510 83.9116 40
g 9.8272 9.9133 9.8038 0.1095 9.7177 40
Intercept 0.0000 0.0000 0.2793 -0.2793 0.2793 40
Slope 1.0000 1.0000 0.9894 0.0106 0.9894 40
| – | index.orig | training | test | optimism | index.corrected | n |
|---|---|---|---|---|---|---|
| \(R^2\) | 0.4787 | 0.4874 | 0.4737 | 0.0137 | 0.4650 | 40 |
index.orig for \(R^2\) is 0.4787. That’s what we get from the data used to fit mod1.validate we create 40 (by default) bootstrapped resamples of the data and then split each of those into training and test samples.
training sample to obtain \(R^2\): mean across 40 splits is 0.4874test sample: average \(R^2\) was 0.4737optimism = training result - test result = 0.0137index.corrected = index.orig - optimism = 0.4650While our nominal \(R^2\) is 0.4787; correcting for optimism yields validated \(R^2\) of 0.4650, so we conclude that \(R^2\) = 0.4650 better estimates how mod1 will perform in new data.
mod1 fit by ols Analysis of Variance Response: cesd
Factor d.f. Partial SS MS F P
mcs 1 31182.7237 31182.72373 379.42 <.0001
subst 2 968.7563 484.37816 5.89 0.003
REGRESSION 3 33886.8359 11295.61195 137.44 <.0001
ERROR 449 36901.6542 82.18631
anova() after a fit using lm().lm, this is a sequential ANOVA table, so if we had included subst in the model first, we’d get a different SS, MS, F and p for mcs and subst, but the same REGRESSION and ERROR results.mod1 fit by ols Effects Response : cesd
Factor Low High Diff. Effect S.E. Lower 0.9
mcs 21.676 40.941 19.266 -12.6580 0.64984 -13.7290
subst - cocaine:alcohol 1.000 2.000 NA -3.4440 1.00550 -5.1013
subst - heroin:alcohol 1.000 3.000 NA -1.7791 1.06810 -3.5396
Upper 0.9
-11.587000
-1.786700
-0.018654
subst effects estimated by this model?
subst being cocaine instead of alcohol on ces_d is -3.44 assuming no change in mcs, with 90% CI (-5.10, -1.79).subst being heroin instead of alcohol on ces_d is -1.78 assuming no change in mcs, with 90% CI (-3.54, -0.02).But what about the mcs effect?
mod1 fit by ols Effects Response : cesd
Factor Low High Diff. Effect S.E. Lower 0.9
mcs 21.676 40.941 19.266 -12.6580 0.64984 -13.7290
subst - cocaine:alcohol 1.000 2.000 NA -3.4440 1.00550 -5.1013
subst - heroin:alcohol 1.000 3.000 NA -1.7791 1.06810 -3.5396
Upper 0.9
-11.587000
-1.786700
-0.018654
mcs: -12.66 is the estimated change in cesd associated with a move from mcs = 21.68 (see Low value) to mcs = 40.94 (the High value) assuming no change in subst.ols chooses the Low and High values from the interquartile range.mcs on cesd holding subst at its baseline (alcohol).subst on cesd holding mcs at its median (28.602417).ols fitFor complex models (this model isn’t actually very complex) it can be helpful to have a tool that will help you see the modeled effects in terms of their impact on the predicted outcome.
A nomogram is an established graphical tool for doing this.
cesd for this subject.mod1 fitPredicted cesd if mcs = 35 and subst = heroin?
predict function for our ols fit provides fitted values.broom package can also support rms fitsmod1We would like our model to be well-calibrated, in the following sense…
mod1We’d like to look at the relationship between the observed cesd outcome and our predicted cesd from the model.
x = TRUE, y = TRUE in ols.mod1
n=453 Mean absolute error=0.128 Mean squared error=0.02428
0.9 Quantile of absolute error=0.267
mod1?The dfbeta value for a particular subject and coefficient \(\beta\) is the change in the coefficient that happens when the subject is excluded from the model.
$Intercept
[1] "8" "351" "405" "433"
$mcs
[1] "351" "402" "450"
$subst
[1] "351"
dfbetas that exceed the specified cutoff (default is 0.2 but it’s an arbitrary choice.)w <- which.influence(mod1, cutoff = 0.2)
d <- helpdat |> select(mcs, subst, cesd) |> data.frame()
show.influence(w, d) Count mcs subst
8 1 9.16053 alcohol
351 3 *57.48944 *heroin
402 1 *55.47938 alcohol
405 1 15.07887 alcohol
433 1 18.59431 alcohol
450 1 *62.17550 alcohol
helpdat |> slice(351) to see row 351 in its entirety.lm fit) to check Cook’s distances.olsmod1To fit more complete residual plots (and other things) we can fit the lm version of this same model…
mod1_lm Residual Plotsmod1_lm Residual PlotsIn building a linear regression model, we’re most often going to be thinking about:
A polynomial regression involves a polynomial in the variable x of degree D (linear combination of powers of x up to D.)
An orthogonal polynomial sets up a model design matrix and then scales those columns so that each column is uncorrelated with the others.
pcs to predict cesd?pcs and cesdpcs to predict cesd?olspol() from the rms package here to fit orthogonal polynomials, rather than poly() which we used in an lm fit.pcs)Linear Regression Model
ols(formula = cesd ~ pcs, data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 40.57 R2 0.086
sigma11.9796 d.f. 1 R2 adj 0.084
d.f. 451 Pr(> chi2) 0.0000 g 4.177
Residuals
Min 1Q Median 3Q Max
-28.4116 -7.8036 0.6846 8.7917 29.3281
Coef S.E. t Pr(>|t|)
Intercept 49.1673 2.5728 19.11 <0.0001
pcs -0.3396 0.0522 -6.50 <0.0001
pcs)Linear Regression Model
ols(formula = cesd ~ pol(pcs, 2), data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 40.68 R2 0.086
sigma11.9915 d.f. 2 R2 adj 0.082
d.f. 450 Pr(> chi2) 0.0000 g 4.199
Residuals
Min 1Q Median 3Q Max
-28.387 -7.750 0.591 8.634 29.697
Coef S.E. t Pr(>|t|)
Intercept 46.4007 8.7967 5.27 <0.0001
pcs -0.2136 0.3867 -0.55 0.5809
pcs^2 -0.0014 0.0041 -0.33 0.7424
pcs)Linear Regression Model
ols(formula = cesd ~ pol(pcs, 3), data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 48.70 R2 0.102
sigma11.8991 d.f. 3 R2 adj 0.096
d.f. 449 Pr(> chi2) 0.0000 g 4.556
Residuals
Min 1Q Median 3Q Max
-27.5245 -8.2651 0.7988 8.9004 27.4480
Coef S.E. t Pr(>|t|)
Intercept -13.4076 22.8605 -0.59 0.5578
pcs 4.1323 1.5825 2.61 0.0093
pcs^2 -0.1010 0.0354 -2.85 0.0046
pcs^3 0.0007 0.0003 2.83 0.0049
First, we need to store the values. Since broom doesn’t play well with ols fits, so I’ll just add the predictions as columns
ggplot(cesd_fits, aes(x = pcs, y = cesd)) +
geom_point() +
geom_line(aes(x = pcs, y = fitB1),
col = "blue", size = 1.25) +
geom_line(aes(x = pcs, y = fitB2),
col = "black", size = 1.25) +
geom_line(aes(x = pcs, y = fitB3),
col = "red", size = 1.25) +
geom_text(x = 18, y = 47, label = "Linear Fit",
size = 5, col = "blue") +
geom_text(x = 18, y = 39, label = "Quadratic Fit",
size = 5, col = "black") +
geom_text(x = 18, y = 26, label = "Cubic Fit",
size = 5, col = "red") +
labs(title = "Linear, Quadratic and Cubic Fits for cesd using pcs") Predict()Restricted cubic splines can fit many different types of non-linearities. Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
olsLet’s consider a restricted cubic spline model for cesd based on pcs with:
modC3, 4 knots in modC4, and 5 knots in modC5pcs)Linear Regression Model
ols(formula = cesd ~ rcs(pcs, 3), data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 40.79 R2 0.086
sigma11.9901 d.f. 2 R2 adj 0.082
d.f. 450 Pr(> chi2) 0.0000 g 4.206
Residuals
Min 1Q Median 3Q Max
-28.3462 -7.7005 0.5098 8.6376 29.8454
Coef S.E. t Pr(>|t|)
Intercept 47.3631 4.7053 10.07 <0.0001
pcs -0.2908 0.1187 -2.45 0.0146
pcs' -0.0624 0.1363 -0.46 0.6471
pcs)Linear Regression Model
ols(formula = cesd ~ rcs(pcs, 4), data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 51.31 R2 0.107
sigma11.8648 d.f. 3 R2 adj 0.101
d.f. 449 Pr(> chi2) 0.0000 g 4.590
Residuals
Min 1Q Median 3Q Max
-28.3147 -8.2830 0.8559 8.8866 26.5458
Coef S.E. t Pr(>|t|)
Intercept 33.3298 6.5742 5.07 <0.0001
pcs 0.1464 0.1856 0.79 0.4308
pcs' -1.4383 0.4497 -3.20 0.0015
pcs'' 6.2561 1.9076 3.28 0.0011
pcs)Linear Regression Model
ols(formula = cesd ~ rcs(pcs, 5), data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 54.64 R2 0.114
sigma11.8345 d.f. 4 R2 adj 0.106
d.f. 448 Pr(> chi2) 0.0000 g 4.744
Residuals
Min 1Q Median 3Q Max
-29.396 -7.928 1.016 8.762 26.974
Coef S.E. t Pr(>|t|)
Intercept 39.0631 7.8282 4.99 <0.0001
pcs -0.0436 0.2332 -0.19 0.8517
pcs' -0.2952 1.0079 -0.29 0.7697
pcs'' -3.1835 4.8079 -0.66 0.5082
pcs''' 14.4216 8.3721 1.72 0.0857
p1 <- ggplot(Predict(mod_B1)) + ggtitle("B1: Linear")
p2 <- ggplot(Predict(mod_B2)) + ggtitle("B2: Quadratic")
p3 <- ggplot(Predict(mod_B3)) + ggtitle("B3. Cubic")
p4 <- ggplot(Predict(mod_C3)) + ggtitle("C3: 3-knot RCS")
p5 <- ggplot(Predict(mod_C4)) + ggtitle("C4. 4-knot RCS")
p6 <- ggplot(Predict(mod_C5)) + ggtitle("C5. 5-knot RCS")
(p1 + p2 + p3) / (p4 + p5 + p6)set.seed(432) then validate(mod_B1) etc.| Model | Index-Corrected \(R^2\) | Corrected MSE |
|---|---|---|
| B1 (linear) | 0.0848 | 143.25 |
| B2 (quadratic) | 0.0752 | 142.49 |
| B3 (cubic) | 0.0909 | 143.73 |
| C3 (3-knot RCS) | 0.0732 | 143.31 |
| C4 (4-knot RCS) | 0.0870 | 144.00 |
| C5 (5-knot RCS) | 0.0984 | 141.44 |
Restricted cubic splines can fit many different types of non-linearities. Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
sim_data, plotted.lmWe’ll fit:
poly()), andrcs())augment() for fitted values and residualsThis will help us to plot the fits for each of these six models.
c4_im dataThis is from the c4im example we used last Thursday.
lm and rcsFor most applications, three to five knots strike a nice balance between complicating the model needlessly and fitting data pleasingly. Let’s consider a restricted cubic spline model for 1000/bmi based on fruit_c again, but now with:
temp3, 3 knots, andtemp4, 4 knots,bmi and fruit_cm_4int coefficients| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 36.968 | 2.284 | 16.186 | 0.000 | 33.206 | 40.730 |
| rcs(fruit_c, 4)fruit_c | 0.708 | 1.538 | 0.460 | 0.645 | −1.825 | 3.241 |
| rcs(fruit_c, 4)fruit_c' | 0.874 | 7.075 | 0.124 | 0.902 | −10.780 | 12.528 |
| rcs(fruit_c, 4)fruit_c'' | −3.021 | 20.152 | −0.150 | 0.881 | −36.216 | 30.173 |
| exerany | 2.574 | 1.888 | 1.363 | 0.173 | −0.536 | 5.684 |
| healthVG | 1.709 | 1.978 | 0.864 | 0.388 | −1.550 | 4.968 |
| healthG | −1.927 | 1.962 | −0.982 | 0.326 | −5.159 | 1.305 |
| healthF | −6.217 | 2.157 | −2.883 | 0.004 | −9.769 | −2.664 |
| healthP | −7.629 | 3.117 | −2.448 | 0.015 | −12.762 | −2.495 |
| exerany:healthVG | −2.955 | 2.183 | −1.353 | 0.176 | −6.551 | 0.641 |
| exerany:healthG | −1.929 | 2.187 | −0.882 | 0.378 | −5.530 | 1.673 |
| exerany:healthF | 4.878 | 2.512 | 1.942 | 0.053 | 0.740 | 9.017 |
| exerany:healthP | 5.802 | 3.673 | 1.579 | 0.115 | −0.249 | 11.853 |
m_4int Residual Plotsm_4int Residual Plotsm_4 and m_4int do in testing?m4_test_aug <- augment(m_4, newdata = test_c4im) |>
mutate(bmi_fit = 1000/.fitted)
m4int_test_aug <- augment(m_4int, newdata = test_c4im) |>
mutate(bmi_fit = 1000/.fitted)
testing_r2 <- bind_rows(
rsq(m4_test_aug, truth = bmi, estimate = bmi_fit),
rsq(m4int_test_aug, truth = bmi, estimate = bmi_fit)) |>
mutate(model = c("m4", "m4int"))
testing_rmse <- bind_rows(
rmse(m4_test_aug, truth = bmi, estimate = bmi_fit),
rmse(m4int_test_aug, truth = bmi, estimate = bmi_fit)) |>
mutate(model = c("m4", "m4int"))
testing_mae <- bind_rows(
mae(m4_test_aug, truth = bmi, estimate = bmi_fit),
mae(m4int_test_aug, truth = bmi, estimate = bmi_fit)) |>
mutate(model = c("m4", "m4int"))m_4 and m_4int in test sampleAfter back-transformation of fitted values of 1000/bmi to bmi:
bind_cols(testing_r2 |> select(model, rsquare = .estimate),
testing_rmse |> select(rmse = .estimate),
testing_mae |> select(mae = .estimate)) |>
gt() |> fmt_number(columns = -model, decimals = 4) |>
tab_options(table.font.size = 20)| model | rsquare | rmse | mae |
|---|---|---|---|
| m4 | 0.0703 | 5.6461 | 4.3076 |
| m4int | 0.0326 | 5.8818 | 4.4816 |
432 Class 05 | 2024-01-30 | https://thomaselove.github.io/432-2024/